-
Notifications
You must be signed in to change notification settings - Fork 687
[Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod is deleted #4234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s deleted Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
|
cc @Future-Outlier @rueian PTAL |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @400Ping
let us know if you need any help, plz ping us
|
Ok, will do thanks. |
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
b8077b9 to
c4bfd24
Compare
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
| return len(pods.Items) | ||
| }, TestTimeoutMedium, 2*time.Second).Should(Equal(1)) | ||
| g.Consistently(RayJob(test, rayJob.Namespace, rayJob.Name), TestTimeoutShort). | ||
| ShouldNot(WithTransform(RayJobDeploymentStatus, Equal(rayv1.JobDeploymentStatusFailed))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when the head pod is deleted in default k8sJobMode, we will get into following block and the job deployment status will still be Failed. Don't know if this is the expected behavior?
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 302 to 307 in 2464704
| if finishedAt != nil { | |
| rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | |
| rayJobInstance.Status.Reason = rayv1.AppFailed | |
| rayJobInstance.Status.Message = "Submitter completed but Ray job not found in RayCluster." | |
| break | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I now understand why this logic is written this way, we currently have two scenarios:
- The RayJob loses its head pod before the ray job is submitted.
- The RayJob loses its head pod after the ray job has already been submitted.
For scenario 1:
When the head Pod is deleted while the submitter Pod is attempting to submit the ray job, the submitter encounters a WebSocket connection failure and exits with a non-zero exit code. The K8s Job then creates a new submitter Pod to retry (the submitterBackoffLimit default is 2).
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 663 to 665 in dea5baa
| func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *rayv1.RayJob, submitterTemplate corev1.PodTemplateSpec) error { | |
| logger := ctrl.LoggerFrom(ctx) | |
| submitterBackoffLimit := ptr.To[int32](2) |
Meanwhile, the RayCluster controller recreates the head Pod and worker Pods (only sidecar mode skips head Pod restart).
If the new head Pod becomes ready before the K8s Job reaches backoffLimit, the new submitter Pod could successfully submit the ray job, and the RayJob could eventually transition to JobDeploymentStatusComplete with JobStatus = JobStatusSucceeded. But this is need a good luck.
For Scenario 2:
When the head Pod is deleted after the ray job has already been submitted, the K8s Job is marked as Completed because the Pod exits with exit code 0, even though the ray job itself has not completed.
And then, when the controller tries to query the ray job status via GetJobInfo, it returns BadRequest (job not found). The RayJob will transition JobDeploymentStatusFailed with JobStatus = JobStatusFailed. (This is fix by #3860)
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 289 to 306 in dea5baa
| jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId) | |
| if err != nil { | |
| // If the Ray job was not found, GetJobInfo returns a BadRequest error. | |
| if errors.IsBadRequest(err) { | |
| if rayJobInstance.Spec.SubmissionMode == rayv1.HTTPMode { | |
| logger.Info("The Ray job was not found. Submit a Ray job via an HTTP request.", "JobId", rayJobInstance.Status.JobId) | |
| if _, err := rayDashboardClient.SubmitJob(ctx, rayJobInstance); err != nil { | |
| logger.Error(err, "Failed to submit the Ray job", "JobId", rayJobInstance.Status.JobId) | |
| return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |
| } | |
| return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil | |
| } | |
| // finishedAt will only be set if submitter finished | |
| if finishedAt != nil { | |
| rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | |
| rayJobInstance.Status.Reason = rayv1.AppFailed | |
| rayJobInstance.Status.Message = "Submitter completed but Ray job not found in RayCluster." | |
| break |
If not returns BadRequest then goes to JobDeploymentStatusTransitionGracePeriodExceeded.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 1138 to 1161 in dea5baa
| func checkSubmitterFinishedTimeoutAndUpdateStatusIfNeeded(ctx context.Context, rayJob *rayv1.RayJob, finishedAt *time.Time) bool { | |
| logger := ctrl.LoggerFrom(ctx) | |
| // Check if timeout is configured and submitter has finished | |
| if finishedAt == nil { | |
| return false | |
| } | |
| // Check if timeout has been exceeded | |
| if time.Now().Before(finishedAt.Add(DefaultSubmitterFinishedTimeout)) { | |
| return false | |
| } | |
| logger.Info("The RayJob has passed the submitterFinishedTimeoutSeconds. Transition the status to terminal.", | |
| "SubmitterFinishedTime", finishedAt, | |
| "SubmitterFinishedTimeoutSeconds", DefaultSubmitterFinishedTimeout.String()) | |
| rayJob.Status.JobStatus = rayv1.JobStatusFailed | |
| rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | |
| rayJob.Status.Reason = rayv1.JobDeploymentStatusTransitionGracePeriodExceeded | |
| rayJob.Status.Message = fmt.Sprintf("The RayJob submitter finished at %v but the ray job did not reach terminal state within %v", | |
| finishedAt.Format(time.DateTime), DefaultSubmitterFinishedTimeout) | |
| return true | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Kai-Hsun mentioned in slack, I think we can simply remove this line, since the behavior in Scenario 1 is not reliable.
I think it is pretty rare, and typically it will hit backoffLimit before the new Pod is ready and the whole cluster needs to restart. A lot of potential unexpected behavior here. Users should not rely on it.
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>
Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @400Ping do you mind solve the comments?
Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>
Co-authored-by: Nary Yeh <[email protected]> Signed-off-by: Ping <[email protected]>
Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
Done, PTAL |
win5923
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
556907f to
c0d916d
Compare
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>

Why are these changes needed?
In this PR, we fail the RayJob if the RayCluster is provisioned. However, this is a breaking change since we shouldn’t include other modes like K8sJobMode or HTTPMode.
Related issue number
Closes #4176
Checks